{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "Lab3ML.ipynb",
      "provenance": [],
      "collapsed_sections": []
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "cells": [
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "w2K8MexQFlP9",
        "outputId": "e30e0d0f-8f9e-41ac-f925-1f97f3e2ea0a"
      },
      "source": [
        "from sklearn import datasets\n",
        "\n",
        "digits_data = datasets.load_digits()\n",
        "print(digits_data.DESCR)"
      ],
      "execution_count": 1,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            ".. _digits_dataset:\n",
            "\n",
            "Optical recognition of handwritten digits dataset\n",
            "--------------------------------------------------\n",
            "\n",
            "**Data Set Characteristics:**\n",
            "\n",
            "    :Number of Instances: 5620\n",
            "    :Number of Attributes: 64\n",
            "    :Attribute Information: 8x8 image of integer pixels in the range 0..16.\n",
            "    :Missing Attribute Values: None\n",
            "    :Creator: E. Alpaydin (alpaydin '@' boun.edu.tr)\n",
            "    :Date: July; 1998\n",
            "\n",
            "This is a copy of the test set of the UCI ML hand-written digits datasets\n",
            "https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits\n",
            "\n",
            "The data set contains images of hand-written digits: 10 classes where\n",
            "each class refers to a digit.\n",
            "\n",
            "Preprocessing programs made available by NIST were used to extract\n",
            "normalized bitmaps of handwritten digits from a preprinted form. From a\n",
            "total of 43 people, 30 contributed to the training set and different 13\n",
            "to the test set. 32x32 bitmaps are divided into nonoverlapping blocks of\n",
            "4x4 and the number of on pixels are counted in each block. This generates\n",
            "an input matrix of 8x8 where each element is an integer in the range\n",
            "0..16. This reduces dimensionality and gives invariance to small\n",
            "distortions.\n",
            "\n",
            "For info on NIST preprocessing routines, see M. D. Garris, J. L. Blue, G.\n",
            "T. Candela, D. L. Dimmick, J. Geist, P. J. Grother, S. A. Janet, and C.\n",
            "L. Wilson, NIST Form-Based Handprint Recognition System, NISTIR 5469,\n",
            "1994.\n",
            "\n",
            ".. topic:: References\n",
            "\n",
            "  - C. Kaynak (1995) Methods of Combining Multiple Classifiers and Their\n",
            "    Applications to Handwritten Digit Recognition, MSc Thesis, Institute of\n",
            "    Graduate Studies in Science and Engineering, Bogazici University.\n",
            "  - E. Alpaydin, C. Kaynak (1998) Cascading Classifiers, Kybernetika.\n",
            "  - Ken Tang and Ponnuthurai N. Suganthan and Xi Yao and A. Kai Qin.\n",
            "    Linear dimensionalityreduction using relevance weighted LDA. School of\n",
            "    Electrical and Electronic Engineering Nanyang Technological University.\n",
            "    2005.\n",
            "  - Claudio Gentile. A New Approximate Maximal Margin Classification\n",
            "    Algorithm. NIPS. 2000.\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "7DqdUCwKI-h4"
      },
      "source": [
        "#Split into x data and y targets\n",
        "x, y = digits_data.data, digits_data.target\n",
        "\n",
        "from sklearn.model_selection import train_test_split\n",
        "#split data with train_test_split\n",
        "x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.7, random_state=0)"
      ],
      "execution_count": 2,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "nSAw-lZrJyr8",
        "outputId": "8db16a3f-51b9-497b-be4f-5cdd5ba2a9c3"
      },
      "source": [
        "#Create a classifier (Decision tree classifier)\n",
        "from sklearn import tree,metrics\n",
        "\n",
        "dtc = tree.DecisionTreeClassifier(criterion=\"entropy\", max_depth=5)\n",
        "dtc.fit(x_train, y_train)\n",
        "\n",
        "y_pred = dtc.predict(x_test)\n",
        "\n",
        "expected = y_test\n",
        "predicted = y_pred\n",
        "\n",
        "print(\"Accuracy: \",metrics.accuracy_score(y_test, y_pred))\n",
        "print(metrics.classification_report(y_test, y_pred))"
      ],
      "execution_count": 10,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Accuracy:  0.7481481481481481\n",
            "              precision    recall  f1-score   support\n",
            "\n",
            "           0       1.00      0.91      0.95        45\n",
            "           1       0.56      0.62      0.59        52\n",
            "           2       0.74      0.70      0.72        53\n",
            "           3       0.87      0.72      0.79        54\n",
            "           4       0.82      0.75      0.78        48\n",
            "           5       0.79      0.86      0.82        57\n",
            "           6       0.82      0.88      0.85        60\n",
            "           7       0.63      0.85      0.73        53\n",
            "           8       0.57      0.49      0.53        61\n",
            "           9       0.81      0.74      0.77        57\n",
            "\n",
            "    accuracy                           0.75       540\n",
            "   macro avg       0.76      0.75      0.75       540\n",
            "weighted avg       0.75      0.75      0.75       540\n",
            "\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "57NgWawXhAZc",
        "outputId": "a0ddd63a-bde0-4348-d769-16d14330d9a4"
      },
      "source": [
        "#Classifier 2. Naive Bayes\n",
        "from sklearn.naive_bayes import GaussianNB\n",
        "##We will reuse split data for consistency\n",
        "\n",
        "gnbc = GaussianNB()\n",
        "\n",
        "gnbc.fit(x_train, y_train)\n",
        "\n",
        "y_pred = gnbc.predict(x_test)\n",
        "\n",
        "print(\"Accuracy: \",metrics.accuracy_score(y_test, y_pred))\n",
        "print(metrics.classification_report(y_test, y_pred))"
      ],
      "execution_count": 11,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Accuracy:  0.8240740740740741\n",
            "              precision    recall  f1-score   support\n",
            "\n",
            "           0       1.00      1.00      1.00        45\n",
            "           1       0.74      0.88      0.81        52\n",
            "           2       0.96      0.49      0.65        53\n",
            "           3       0.66      0.85      0.74        54\n",
            "           4       0.95      0.75      0.84        48\n",
            "           5       0.98      0.89      0.94        57\n",
            "           6       0.95      0.98      0.97        60\n",
            "           7       0.79      0.98      0.87        53\n",
            "           8       0.61      0.84      0.70        61\n",
            "           9       0.97      0.58      0.73        57\n",
            "\n",
            "    accuracy                           0.82       540\n",
            "   macro avg       0.86      0.83      0.82       540\n",
            "weighted avg       0.86      0.82      0.82       540\n",
            "\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "-oCv8LwvjTL-",
        "outputId": "0c2b635b-6510-4f71-d52a-359f2099f5fc"
      },
      "source": [
        "#stochastic gradient descent\n",
        "from sklearn.linear_model import SGDClassifier\n",
        "\n",
        "sgdc = SGDClassifier(loss=\"hinge\", penalty=\"l2\", max_iter=5)\n",
        "\n",
        "sgdc.fit(x_test, y_test)\n",
        "\n",
        "y_pred = sgdc.predict(x_test)\n",
        "\n",
        "print(\"Accuracy: \",metrics.accuracy_score(y_test, y_pred))\n",
        "print(metrics.classification_report(y_test, y_pred))"
      ],
      "execution_count": 17,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Accuracy:  0.9462962962962963\n",
            "              precision    recall  f1-score   support\n",
            "\n",
            "           0       1.00      1.00      1.00        45\n",
            "           1       0.98      0.92      0.95        52\n",
            "           2       1.00      1.00      1.00        53\n",
            "           3       1.00      0.80      0.89        54\n",
            "           4       1.00      0.98      0.99        48\n",
            "           5       0.98      0.91      0.95        57\n",
            "           6       0.94      1.00      0.97        60\n",
            "           7       0.93      1.00      0.96        53\n",
            "           8       0.95      0.90      0.92        61\n",
            "           9       0.77      0.96      0.86        57\n",
            "\n",
            "    accuracy                           0.95       540\n",
            "   macro avg       0.96      0.95      0.95       540\n",
            "weighted avg       0.95      0.95      0.95       540\n",
            "\n"
          ],
          "name": "stdout"
        },
        {
          "output_type": "stream",
          "text": [
            "/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_stochastic_gradient.py:557: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit.\n",
            "  ConvergenceWarning)\n"
          ],
          "name": "stderr"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "MQjEW_jEkmYJ"
      },
      "source": [
        "According to the runs performed above, we can conclude that the best performing classifier is the Stochastic gradient descent classifier. It was able to correctly predict the expected value roughly 94% of the time. The next best was the Naive Bayes classifier with an accuracy of roughly 82%. The worst performer was the decision tree classifier, which correctly predicted the expected value roughly 78% of the time."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "tPx-PHLMlcc2"
      },
      "source": [
        "#Question 5\n",
        "\n",
        "Labelling data is the grouping of data items that contain similar characteristics so that both humans andmachine are able to identify the different groups. The goalof data classification is to help classification algorithm identify these charectaristics unique to each group or class so that it can correctly predict the right group for an item with an unknown class. This is the learning part of the classifier model. The “No Free Lunch theorem” (Macready, 1997) defines that a classifier will perform better over some distributions and worse over others. Bias also arises beacause of the finite nature of the training data, which does not represent reality accurately. Automated labelling of the dataset seems to be the best method of labelling it. It consumes much less time and offers much better results than manual categorization. The automation method would be transfer of learning, whereby one modelis used to classify and effectively label the unclassified data. Of course accuracy of the labelling results depend wholy on the quality of the model used to classify the data."
      ]
    }
  ]
}